Conversation
|
@greptile |
Greptile OverviewGreptile SummaryThis PR introduces a parallel fixture execution framework to optimize test execution time by parallelizing fixture setup/teardown operations. The implementation includes:
The architecture is well-documented and addresses real concurrency challenges in the SONiC test infrastructure. However, there are several logical issues that need attention:
Confidence Score: 3/5
|
| Filename | Overview |
|---|---|
| tests/common/devices/base.py | Fixed closure bug where module_name was captured incorrectly by returning a wrapper function |
| tests/common/devices/duthosts.py | Fixed closure bug and added explicit initialization of _nodes attributes to prevent __getattr__ recursion |
| tests/common/devices/multi_asic.py | Fixed closure bug where multi_asic_attr was captured incorrectly by returning a wrapper function |
| tests/common/helpers/parallel.py | Added deadlock prevention for fork operations with logging locks, patched Ansible worker processes |
| tests/common/plugins/parallel_fixture/init.py | Implemented parallel fixture execution framework with thread pool, scope-based barriers, and exception handling |
Sequence Diagram
sequenceDiagram
participant Pytest
participant ParallelManager
participant ThreadPool
participant MonitorThread
participant Fixture
participant Barrier
Note over Pytest,Barrier: Session Setup Phase
Pytest->>ParallelManager: Create (session scope)
ParallelManager->>ThreadPool: Initialize worker threads
ParallelManager->>MonitorThread: Start monitoring
Pytest->>Fixture: Execute fixture (session scope)
Fixture->>ParallelManager: submit_setup_task(SESSION, func)
ParallelManager->>ThreadPool: Submit task to thread pool
ThreadPool-->>ParallelManager: Return future
Fixture-->>Pytest: Yield immediately
Pytest->>Barrier: setup_barrier_session
Barrier->>ParallelManager: wait_for_setup_tasks(SESSION)
ParallelManager->>ThreadPool: Wait for all session tasks
ThreadPool-->>ParallelManager: Tasks complete
Note over Pytest,Barrier: Module/Class/Function Scopes
Pytest->>Fixture: Execute fixture (module scope)
Fixture->>ParallelManager: submit_setup_task(MODULE, func)
ParallelManager->>ThreadPool: Submit to pool
Fixture-->>Pytest: Yield
Pytest->>Barrier: setup_barrier_module
Barrier->>ParallelManager: wait_for_setup_tasks(MODULE)
ParallelManager->>ThreadPool: Wait for module tasks
Note over Pytest,Barrier: Test Execution
Pytest->>ParallelManager: pytest_runtest_call hook
ParallelManager->>ThreadPool: Ensure all tasks complete
ParallelManager->>ThreadPool: Terminate executor
Pytest->>Pytest: Run test function
Note over Pytest,Barrier: Teardown Phase
Pytest->>ParallelManager: pytest_runtest_teardown hook
ParallelManager->>ThreadPool: Reset and create new executor
ParallelManager->>MonitorThread: Restart monitoring
Pytest->>Fixture: Teardown fixture
Fixture->>ParallelManager: submit_teardown_task(scope, func)
ParallelManager->>ThreadPool: Submit teardown task
Fixture-->>Pytest: Return
Pytest->>Barrier: teardown_barrier_function
Barrier->>ParallelManager: wait_for_teardown_tasks(FUNCTION)
ParallelManager->>ThreadPool: Wait for function teardowns
Pytest->>ParallelManager: pytest_runtest_logreport hook
ParallelManager->>ThreadPool: Terminate executor
ParallelManager->>MonitorThread: Stop monitoring
| def raise_async_exception(tid, exc_type): | ||
| """Injects an exception into the specified thread.""" | ||
| if not isinstance(tid, int): | ||
| raise TypeError("Thread ID must be an integer") | ||
|
|
||
| ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(tid), | ||
| ctypes.py_object(exc_type)) |
There was a problem hiding this comment.
The PyThreadState_SetAsyncExc call doesn't check the return value. According to Python docs, it returns the number of thread states modified. If it returns 0, the thread ID is invalid. If it returns >1, multiple thread states were affected (shouldn't happen but indicates a serious issue).
| def raise_async_exception(tid, exc_type): | |
| """Injects an exception into the specified thread.""" | |
| if not isinstance(tid, int): | |
| raise TypeError("Thread ID must be an integer") | |
| ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(tid), | |
| ctypes.py_object(exc_type)) | |
| def raise_async_exception(tid, exc_type): | |
| """Injects an exception into the specified thread.""" | |
| if not isinstance(tid, int): | |
| raise TypeError("Thread ID must be an integer") | |
| ret = ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(tid), | |
| ctypes.py_object(exc_type)) | |
| if ret == 0: | |
| logging.warning("Thread ID %s not found when injecting exception", tid) | |
| elif ret > 1: | |
| # Revert if we affected multiple threads | |
| ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(tid), None) | |
| raise SystemError(f"PyThreadState_SetAsyncExc affected {ret} threads") |
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/common/plugins/parallel_fixture/__init__.py
Line: 38:44
Comment:
The `PyThreadState_SetAsyncExc` call doesn't check the return value. According to Python docs, it returns the number of thread states modified. If it returns 0, the thread ID is invalid. If it returns >1, multiple thread states were affected (shouldn't happen but indicates a serious issue).
```suggestion
def raise_async_exception(tid, exc_type):
"""Injects an exception into the specified thread."""
if not isinstance(tid, int):
raise TypeError("Thread ID must be an integer")
ret = ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(tid),
ctypes.py_object(exc_type))
if ret == 0:
logging.warning("Thread ID %s not found when injecting exception", tid)
elif ret > 1:
# Revert if we affected multiple threads
ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(tid), None)
raise SystemError(f"PyThreadState_SetAsyncExc affected {ret} threads")
```
How can I resolve this? If you propose a fix, please make it concise.| try: | ||
| return func(*args, **kwargs) | ||
| except Exception: | ||
| raise sys.exc_info()[0](traceback.format_exc()) |
There was a problem hiding this comment.
The exception re-raising logic uses deprecated sys.exc_info()[0] and loses the original exception chain. This makes debugging harder and may not properly preserve the exception context.
| raise sys.exc_info()[0](traceback.format_exc()) | |
| raise |
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/common/plugins/parallel_fixture/__init__.py
Line: 266:266
Comment:
The exception re-raising logic uses deprecated `sys.exc_info()[0]` and loses the original exception chain. This makes debugging harder and may not properly preserve the exception context.
```suggestion
raise
```
How can I resolve this? If you propose a fix, please make it concise.| self.active_futures.clear() | ||
| self.done_futures.clear() |
There was a problem hiding this comment.
Clearing the futures before termination check creates a race condition. If terminate() hasn't been called yet, clearing these sets will lose track of active tasks. The terminated flag should be checked before clearing.
| self.active_futures.clear() | |
| self.done_futures.clear() | |
| if not self.terminated: | |
| raise RuntimeError("Cannot reset a running parallel fixture manager.") | |
| self.active_futures.clear() | |
| self.done_futures.clear() |
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/common/plugins/parallel_fixture/__init__.py
Line: 393:394
Comment:
Clearing the futures before termination check creates a race condition. If `terminate()` hasn't been called yet, clearing these sets will lose track of active tasks. The `terminated` flag should be checked before clearing.
```suggestion
if not self.terminated:
raise RuntimeError("Cannot reset a running parallel fixture manager.")
self.active_futures.clear()
self.done_futures.clear()
```
How can I resolve this? If you propose a fix, please make it concise.| def reset(self): | ||
| """Reset the parallel fixture manager.""" | ||
| if not self.terminated: | ||
| raise RuntimeError("Cannot reset a running parallel fixture manager.") | ||
|
|
||
| self.active_futures.clear() | ||
| self.done_futures.clear() | ||
| self.executor = ThreadPoolExecutor(max_workers=self.worker_count) | ||
| self.is_monitor_running = True | ||
| self.monitor_thread = threading.Thread(target=self._monitor_workers, daemon=True) | ||
| self.monitor_thread.start() |
There was a problem hiding this comment.
reset() doesn't set self.terminated = False after resetting the executor. This means once terminated, the manager can never be used again because terminate() will immediately return at line 335 due to the terminated check. Either reset should set terminated = False, or the design should clarify if reset is only for internal reuse.
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/common/plugins/parallel_fixture/__init__.py
Line: 388:398
Comment:
`reset()` doesn't set `self.terminated = False` after resetting the executor. This means once terminated, the manager can never be used again because `terminate()` will immediately return at line 335 due to the terminated check. Either reset should set `terminated = False`, or the design should clarify if reset is only for internal reuse.
How can I resolve this? If you propose a fix, please make it concise.8ab4754 to
e1bc7a9
Compare
Replace shell-based openssl certificate generation with Python cryptography library. Certificates are now backdated by 1 day to handle clock skew between test host, DUT, and PTF container. This fixes the error: "tls: failed to verify certificate: x509: certificate has expired or is not yet valid: current time ... is before ..." Changes: - Add tests/common/cert_utils.py with TlsCertificateGenerator class - Add create_gnmi_cert_generator() factory for gNMI/gNOI naming - Update grpc_fixtures.py to use Python-based cert generation - Remove localhost fixture dependency from setup_gnoi_tls_server Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
What is the motivation for this PR? Test DUTs with 64 ports connected to peers. How did you do it? How did you verify/test it? Prepare the DUT and verified bgp/test_bgp_fact passes. Signed-off-by: Vinay Kaza <vinay@nexthop.ai> * Updated with latest tested topology Signed-off-by: Vinay Kaza <vinay@nexthop.ai> * Fix end of file static checker Signed-off-by: Vinay Kaza <vinay@nexthop.ai> * Change neighbor VM name from T1 to LT2 Signed-off-by: Vinay Kaza <vinay@nexthop.ai> --------- Signed-off-by: Vinay Kaza <vinay@nexthop.ai>
…sonic-net#21990) What is the motivation for this PR? Running many individual SSH commands in test scripts resulted in excessive overhead due to repeated SSH session establishments, leading to unnecessarily long test times. The total test time was 1 hr 26 mins, current implementation makes one SSH call for 450 ports and its each 7 UC queues separately, making roughly 3150 ssh commands assuming each SSH call takes 1 to 1.5 mins, 1 hr 20 mins of time is spent here fetching the data. The motivation is to improve test efficiency and make large test runs more practical by reducing SSH connection overhead. How did you do it? Identified places in the test where SSH commands are executed in sequence or within loops. Rewrote those sections to batch multiple commands together and execute them through a single SSH session, fetching show queue counters for all ports and queue together. Ensured that command outputs and error handling remained compatible with existing test validations. How did you verify/test it? Ran affected test scripts before and after the changes to compare results and execution time. Verified that all expected outputs and pass/fail criteria remain the same. Monitored for regressions, failures, or unexpected test behaviors possibly introduced by batching commands. The execution time dropped from 1 hr 26 mins to 3 mins. Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>
) Summary: Replace fixed sleeps with polling and reduce wait times: Add polling helpers: wait_for_crm_counter_update(), wait_for_resource_stabilization() Replace 50s resource waits with adaptive polling Reduce config waits from 10s to 5s (CONFIG_UPDATE_TIME) Reduce cleanup wait from 50s to 20s (SONIC_RES_CLEANUP_UPDATE_TIME) Signed-off-by: AntonHryshchuk <antonh@nvidia.com>
…le topos. (sonic-net#22169) What is the motivation for this PR? The generic hash test run time depends on the number of egress ports in the topology. When running on the scale topology like t0-isolated-d256u256s2, it takes more than 6 hours. This change is to optimize the runtime. The regular topologies also benefit from this. How did you do it? Decrease the total number of sent packets based on the number of egress ports. Meanwhile relax the threshold for balancing check also based on the number of egress port. Skip the lag related cases on topologies which have no lags. Skip some test cases when there is not enough test ports or the variations in the tested field are too few. For example when the hash field is IP_PROTOCOL, there are total 256 values, it not enough to hash to all 256 next hops perfectly. Improve the test result output. Add asic db route check in reboot test case. After reboot/reload, it takes time for the default route to be installed in the ASIC, only checking the kernel route is not enough for scale topos. How did you verify/test it? Run the test on SN5640 with t0-isolated-d256u256s2 topo. The total runtime is about 1.5 ~ 2 hours now. Signed-off-by: Cong Hou <congh@nvidia.com>
…onic-net#22287) Description of PR Changed container_upgrade tests to support restapi containers (restapi, restapi_watchdog, and restapi_sidecar). Summary: Microsoft ADO ID: 36672179 Approach What is the motivation for this PR? Adding support for restapi containers in container_upgrade tests so that we can run restapi tests using the same container images on different OS versions. How did you do it? Added instructions for creating restapi containers and running restapi tests in container_upgrade. How did you verify/test it? Tested on a T1 switch running 202505. The restapi tests passed.
What is the motivation for this PR? Fix test_bgp_suppress_fib.py flakiness on scale topos. Extension of BULK_TRAFFIC_WAIT_TIME and ignoring malformed BGP route packets allows these tests to pass consistently on Arista t1 isolated DUTs How did you do it? How did you verify/test it? Tested on T0 and T1 isolated Arista DUTs
* Fix AttributeError in conftest.py In the ptfhosts pytest fixture in conftest.py, a method "_hosts.apend" appears, which is a typo corrected by this change to "_hosts.append". Signed-off-by: Christopher Croy <ccroy@arista.com> * Fix flake8 lint indentation error Signed-off-by: Christopher Croy <ccroy@arista.com> --------- Signed-off-by: Christopher Croy <ccroy@arista.com>
…ic-net#22201) orch_northbond_route_zmq_enabled was enabled in: sonic-net#20441 which enables fpmsyncd to send route events to orchagent via a ZMQ channel instead of Redis. test_bgp_route_with_suppress_negative_operation kills orchagent and restarts bgp session. After restart, fpmsyncd tries to send all updates to orchagent via ZMQ, but orchagent is still down. Eventually, ZMQ buffer is filled up and causes the error. Ignore this error log as it is expected. Signed-off-by: markxiao <markxiao@arista.com>
Signed-off-by: Austin Pham <austinpham@microsoft.com>
…faces data for the test
…ication to handle IPV6 only topologies (sonic-net#21863)
…c-net#21561) What is the motivation for this PR? Expand the current NTP test case to verify that changing the source interface configuration for NTP works correctly. How did you do it? Add a test case to change the NTP source interface to Loopback0, and then back to eth0. How did you verify/test it? Tested on T0 KVM. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
What is the motivation for this PR? Add a test case to verify the ability to add and remove DNS nameservers via incremental config. How did you do it? Model this after the NTP GCU test case, and add two random DNS nameservers and remove one. How did you verify/test it? Tested locally on KVM T0. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
…System (sonic-net#20872) What is the motivation for this PR? sonic-mgmt tests need to be improved for FS voq How did you do it? access to Chassis DB is done only if the voq is a chassis skipped test_po_voq.py for single ASIC FS as this test checks for additons/deletions/updates of LAGs across all asics in a chassis using chassis APP DB How did you verify/test it? ran the tests on FS voq Signed-off-by: Lakshmi Yarramaneni <lakshmi@nexthop.ai> Co-authored-by: Xin Wang <xiwang5@microsoft.com>
* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (sonic-net#401) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: Fixes # (issue) Fixes below json syntax error. It is seen only when dut is prepared with macsec enable flag. json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 3 (char 4) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? #### How did you do it? #### How did you verify/test it? #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <rajshekhar@nexthop.ai> * ISS-2969:Generate golden config only if macsec_profile is defined (sonic-net#420) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR Redundant override config is avoided as no macsec profile is set in the prepare phase. Below are details how macsec profile configurations are rendered: PREPARE phase: Uses generate_t2_golden_config_db() → template rendering → file-based config RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands → immediate CONFIG_DB update Summary: Fixes # (issue) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? #### How did you do it? #### How did you verify/test it? #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <rajshekhar@nexthop.ai> * ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (sonic-net#562) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: • Add a StartLimitHit-safe restart helper and use it in MACsec docker restart test to reduce flakiness • New helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py: • Proactively clears systemd failure counters (systemctl reset-failed) • Attempts restart, detects systemd rate limiting (StartLimitHit), applies bounded backoff (default 35s), then start • Verifies the target container becomes running within a timeout • Update tests/macsec/test_docker_restart.py to use the new helper instead of duthost.restart_service("macsec") Fixes # (issue) MACsec docker restart tests can intermittently fail due to systemd rate limiting after repeated restarts during teardown/restart cycles. • Guarding against StartLimitHit with a clear backoff-and-start flow improves test reliability without changing device behavior. ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [ ] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ x] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? • MACsec docker restart tests can intermittently fail when systemd enforces StartLimitHit due to rapid restart attempts during teardown/restart cycles. • This PR makes the restart path resilient to StartLimitHit by proactively clearing counters, applying bounded backoff, and verifying the container reaches the running state, thereby reducing test flakiness. #### How did you do it? • Added a helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py that: • Detects StartLimitHit pre/post restart attempts • Runs systemctl reset-failed to clear counters • Applies a fixed backoff when rate-limited, then systemctl start • Verifies the container is running within a configurable timeout using existing wait_until/state checks • Updated tests/macsec/test_docker_restart.py to use the helper instead of a direct duthost.restart_service("macsec") call. #### How did you verify/test it? • Local validation in lab: • Executed tests/macsec/test_docker_restart.py::test_restart_macsec_docker with MACsec enabled. • Repeated the restart sequence to emulate rate limiting scenarios. • Verified the helper reliably recovers from StartLimitHit and the container becomes running within the timeout. #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <rajshekhar@nexthop.ai> * NOS-3311: Fix MACsec test race and cleanup sync (sonic-net#678) NOS-3311 tracks MACsec test flakiness caused by races between: * wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE DB), and * the test harness eagerly reading that state to build `MACSEC_INFO` (via `get_macsec_attr`). This can manifest as exceptions like `KeyError('sak')` when the MACsec egress SA row does not yet exist, even though `MACSEC_PORT_TABLE` already shows `enable_encrypt="true"`. There are also cleanup races where tests check for removal of MACsec DB entries before the background cleanup logic has finished. This PR adds two pieces of synchronization in sonic-mgmt: 1. Ensure MKA establishment before pre-loading MACsec session info for tests 2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec File: `tests/common/macsec/__init__.py` * The `load_macsec_info` fixture (module-scoped, autouse) previously called `load_all_macsec_info()` immediately when MACsec was enabled and a profile was present. That in turn calls `get_macsec_attr()`, which expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully programmed. * In environments where MACsec is pre-configured before tests start, this created a race: `MACSEC_PORT_TABLE` might already exist (with `enable_encrypt="true"`), but the egress SA row for the active AN might not yet have been written to APP_DB, leading to `KeyError('sak')` when `macsec_sa["sak"]` is accessed. * Fix: * When MACsec is enabled and a profile is present, the fixture now first *attempts* to resolve the `wait_mka_establish` fixture: ```python try: request.getfixturevalue('wait_mka_establish') except Exception: pass ``` * `wait_mka_establish` is defined in `tests/macsec/conftest.py` and internally uses `check_appl_db` plus `wait_until(...)` to ensure APP/STATE DB MACsec SC/SA tables are populated (including `sak`/`auth_key`/PN relationships) before returning. * If the fixture is not defined (e.g., in other environments or test suites), the code falls back to the previous behavior. * After this synchronization point, if `is_macsec_configured(...)` is true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for all control links. Otherwise, the original `macsec_setup` flow is triggered. This makes `get_macsec_attr()` execution order consistent with the rest of the MACsec test suite, which already relies on `wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and SAKs exist before validating state. cleanup File: `tests/common/macsec/macsec_config_helper.py` * Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export it via `__all__`. * This helper is designed for tests that: * disable MACsec on one or more interfaces, and then * need to assert that all associated MACsec entries (port, SC, SA) have been automatically removed from Redis before proceeding. * Behavior: * For EOS neighbors, it is a no-op: they do not use Redis DBs and the function returns `True` immediately. * For SONiC hosts, it: * Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics` with patterns `MACSEC_*:{interface}*` (APPL_DB) and `MACSEC_*|{interface}*` (STATE_DB). * Aggregates any remaining keys per DB. * Returns `True` as soon as all such keys are gone for the given interfaces, logging total time taken. * If the `timeout` is exceeded, logs a warning, prints a summary of remaining entries, and returns `False`. * This centralizes the logic for “wait until MACsec entries are gone from Redis” instead of having ad hoc sleeps or partial checks in individual tests. * MACsec control-plane actions (via wpa_supplicant and swss/macsecorch) are asynchronous relative to the tests. It is valid for `MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs and their SAKs are still being programmed. * `get_macsec_attr()` assumes that: * APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a valid `encoding_an`, and * APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a `sak` field. Without synchronization, tests that pre-load `MACSEC_INFO` can hit a window where the SA row does not yet exist and crash with `KeyError('sak')`. * By tying `load_macsec_info` to `wait_mka_establish` where available, we ensure those pre-loads happen only after the expected MACsec state has been fully written to Redis. * Similarly, when disabling MACsec, asynchronous background cleanup can lag behind the test’s expectations. Having a dedicated, reusable `wait_for_macsec_cleanup` helper lets future tests explicitly wait for cleanup completion instead of guessing with sleeps. * Verified that the new fixtures and helpers are imported and wired correctly: * `load_macsec_info` remains `autouse=True` at module scope, so existing MACsec tests automatically benefit from the additional synchronization. * `wait_for_macsec_cleanup` is exported in `__all__` for use by future MACsec tests. * Manually exercised MACsec configuration and teardown flows in a MACsec-enabled testbed (e.g., humm120) to confirm: * MACsec sessions establish successfully and APP/STATE DB contain expected MACsec entries before `load_all_macsec_info` is invoked. * Disabling MACsec followed by `wait_for_macsec_cleanup` results in all MACSEC_* keys being removed from APPL/STATE DB within the timeout window. --- Pull Request opened by [Augment Code](https://www.augmentcode.com/) with guidance from the PR author Signed-off-by: rajshekhar <rajshekhar@nexthop.ai> * taking care of review comments - Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited. - Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing. - Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers. Signed-off-by: rajshekhar <rajshekhar@nexthop.ai> --------- Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>
…r handling (sonic-net#22082) Description of PR Summary: Fixes ethernet interface test failures due to port availability issues and routeCheck errors on multi-ASIC platforms The below PR is also needed to fix the test completely: PR for FEC related issue - sonic-net#21026 Type of change Bug fix Testbed and Framework(new/improvement) New Test case Skipped for non-supported platforms Test case improvement What is the motivation for this PR? Ethernet interface tests would fail when all ethernet ports were members of PortChannels and no free ports were available for testing. Additionally, tests were failing during teardown with routeCheck errors, particularly on multi-ASIC platforms, where port configuration changes (speed, FEC, admin status) triggered temporary routing inconsistencies that were flagged by the routeCheck monitoring service. How did you do it? Enhanced port availability logic: Added get_test_port() to automatically remove a port from its PortChannel if no free ports are available Added remove_port_from_portchannel() helper function Port removal is temporary - the ensure_dut_readiness fixture's rollback mechanism automatically restores the original configuration Added MTU test protection: checks for free ports without removing from PortChannel (remove_from_portchannel=False), and skips test if none available to avoid routing convergence issues Added loganalyzer fixture (ignore_expected_loganalyzer_exceptions_lag) to suppress expected routeCheck errors that occur during port configuration changes. The fixture ignores: orchagent port speed errors routeCheck temporary failures with unaccounted routes How did you verify/test it? Ran all ethernet interface tests on multi-ASIC linecard platforms where all ports were in PortChannels Verified that tests successfully complete without routeCheck failures during teardown Confirmed that port removal and restoration works correctly via rollback mechanism Signed-off-by: Nanma Purushotam <gupurush@gmail.com> * Trigger CI checks Signed-off-by: Nanma Purushotam <gupurush@gmail.com> * Updated conftest, and removed test based loganalyzer Signed-off-by: Nanma Purushotam <gupurush@gmail.com>
Fix incorrect fallback value for serial device prefix in _get_serial_device_prefix(). Changed from /dev/ttyUSB- to /dev/ttyUSB to generate correct device paths (e.g., /dev/ttyUSB1 instead of /dev/ttyUSB-1). What is the motivation for this PR? When udevprefix.conf cannot be read, the fallback value /dev/ttyUSB- generates invalid device paths. How did you do it? Changed fallback from /dev/ttyUSB- to /dev/ttyUSB and updated related docstrings/comments. How did you verify/test it? Code review. Signed-off-by: cliffchen <t-cliffchen+github@microsoft.com>
What is the motivation for this PR? Enable test_po_update with ipv6-only topo How did you do it? Added steps for ipv6 How did you verify/test it? Ran on ipv6-only topo testbed
What is the motivation for this PR? To enable ipv6 only topo for test_static_route How did you do it? Have a check for ipv6 and use appropriate prefix for it. How did you verify/test it? Tested on ipv6 only testbed to verify
…pv6-default] since it is duplicated with skip condition (sonic-net#21759) What is the motivation for this PR? Remove duplicated xfail, since it should already been cover by the skip condition. So conditions can be more clear, less confusion and less maintenance.
What is the motivation for this PR? The previous test_ipinip teardown could leave the dut in an unhealthy state after the test finishes, causing several docker containers to remain down and impacting subsequent tests. How did you do it? Updated the test_ipinip teardown logic to restore the dut by performing a golden config reload. How did you verify/test it? Ran test_ipinip and passed.
DPU config can be deployed after the switch configuration.
Minigraph deployment command has been extended:
Deploy minigraph for both Switch and DPU:
./testbed-cli.sh deploy-mg {{SWITCH}}-{{TOPO}} lab vault -vvvvv
Deploy minigraph for Switch only:
./testbed-cli.sh deploy-mg {{SWITCH}}-{{TOPO}} lab vault -vvvvv --skip-tags "dpu_config"
Deploy minigraph for DPU only:
./testbed-cli.sh deploy-mg {{SWITCH}}-{{TOPO}} lab vault -vvvvv --tags "dpu_config"
Signed-off-by: nmirin <nikolay.a.mirin@gmail.com>
42e170d to
70897f8
Compare
Signed-off-by: Longxiang <lolv@microsoft.com>
Signed-off-by: Longxiang <lolv@microsoft.com>
Signed-off-by: Longxiang <lolv@microsoft.com>
Signed-off-by: Longxiang <lolv@microsoft.com>
Signed-off-by: Longxiang <lolv@microsoft.com>
Signed-off-by: Longxiang <lolv@microsoft.com>
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
70897f8 to
9f0c4af
Compare
Description of PR
Summary:
Fixes # (issue)
Type of change
Back port request
Approach
What is the motivation for this PR?
How did you do it?
How did you verify/test it?
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation